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Abstract. We present an oriented numerical summarizer algorithm, 
applied to producing automatic summaries of scientific documents in 
Organic Chemistry. We present its implementation named Yachs (Yet 
Another Chemistry Summarizer) that combines a specific document pre- 
processing with a sentence scoring method relying on the statistical prop- 
erties of documents. We show that Yachs achieves the best results among 
several other summarizers on a corpus of Organic Chemistry articles. 

1 Introduction 

Over 1.7 million new Chemistry articles were published in 20070 thereby most of 
scientists today are on information overload. Information extraction technology 
arose in response to the need for efficient processing of documents in special- 
ized domains. Scientists, especially chemists, want to be able to promptly access 
information concealed in a document in addition to the author's abstract that 
is often too concise or not satisfying. Automatically producing summaries from 
Organic Chemistry documents is a challenging but critical task for chemical 
information retrieval. Text Summarization is the process of distilling the most 
important information from a source (or sources) to produce an abridged version 
for a particular user and task [T] . There are many uses of text summarization in 
everyday activities, we are familiar with summaries such as headlines, reviews 
or digests. Introduced by Luhn 2 in the late 1950's, text summarization was 
characterized by the use of a surface level approach (i.e. exploiting term fre- 
quencies) . The first entity-level approaches based on syntactic analysis appeared 
in the early 1960's [3] while the use of location features and cue phrases was 
not developed until later 0]. The investigations reported by [5] at the Chemical 



3 See Chemical Abstracts Publication Record, http://www.cas.org/ 



Abstracts Service (CAS) provide further insight into the effectiveness of auto- 
matic summarization in particular domain areas. Corpus-based approaches were 
introduced by [B] with a trainable summarization system using a collection of 
text/summaries pairs as train-set. A Bayes classifier algorithm takes each sen- 
tence and, based on features such as cue phrases, sentence length or location, 
computes a probability that it should be included in the summary. Thereafter, [7] 
have extended this model using decision tree rules instead of bayesian classifiers. 
Rhetorical status was proposed by |5] to summarize scientific articles (Computa- 
tional Linguistic conference articles) that can highlight the new contribution of 
the source article. The limitations of this approach is that it depends on manual 
resources (metadiscourse features are manually annotated) . [9] proposes to com- 
bine semantic-based and frequency-distribution approaches for extractive text 
summarization in biomedical documents. However, this approach requires a dif- 
ficult concept identification process. Benefits of automatic abstracting are now 
clearly identified: it is inexpensive compared to human effort and, unlike humans, 
it is consistent and avoid subjectivity and variability observed in human abstrac- 
tors. Typically, summarization systems are two-phased, consisting of a content 
selection step followed by a generation step. Firstly, text fragments (most of- 
ten sentences) are assigned a score that reflects how important they are. The 
highest-ranking material can then be arranged and displayed as "extracts" . This 
paper presents yachs (Yet Another Chemistry Summarizer), a summarization 
system that generates extracts from scientific articles in specialized domain, Or- 
ganic Chemistry. The motivation behind this work is to allow non-experts users 
to access information contained in high-end scientific documents by dynamically 
generating extracts. Specifically, through statistical entity level approaches, we 
seek to produce highly informative extracts that can stand in place of the original 
author's abstracts as surrogates. 



2 Method 

2.1 Pre-processing 

The first question we are concerned with is whether classical Natural Language 
Processing (NLP) tools are consistent within the organic chemistry domain. The 
answer is clearly no. Tools such as parsers, taggers or chunkers achieve very poor 
on these documents without requiring a strenuous, costly and often manual adap- 
tation phase. Issues encountered by classical tools are due to domain specificity: 
very wide vocabulary, long sentences containing noise (i.e. citations, chemical 
formulas, tables, pictures references, etc.), high quantity of hapax legomenc^ 
etc. 

The basic idea is to represent the document within the vector space model in- 
troduced by |10j and apply specific numeric treatments to select the most salient 
sentences. An n-dimensional term-space r, where n is the number of different 
terms found in the document, is constructed. One convenient way to represent 



4 Terms which only appears once in a document. 



the document in r is a matrix M = [a x ,y] x=i...m; y=i...n where m is the number 
of sentences and n the number of different terms. In this interpretation, every 
row of M is a vector s x representing the sentence x in which each component is 
the term frequency within the sentence. 

In order to reduce the size of the matrix M and accordingly reduce the com- 
putational complexity, some reductions and filtering are applied to sentences 
(see table [lj. In written language, some words carry more meaning than others. 
Thereby, a stop- words elimination phase is performed (1) to delete non repre- 
sentative words (i.e. words such as t the\ 'o/', 'm'... are removed). One standard 
pre-processing would normalize character case, remove punctuation and special 
characters (2). 

However, important information about chemical compounds may be lost dur- 
ing the filtering process (e.g. '1,2-dienes' is transformed into 'dienes'). Besides if 
word normalization (i.e. stemming^]) is applied afterwards (3), erroneous infor- 
mation is brought in the sentence (e.g. '1,2-dienes' is transformed into 'dzen'). 
We propose to perform a chemical compounds detection to protect these terms 
during the normalization process (2'). Finally stemming is performed only on 
non-chemical terms (3'). 



Original Cycloalkynes are known to isomerize to the 1,2-dienes under basic conditions. 
Cycloalkynes known isomerize 1,2-dienes under basic conditions, 
cycloalkynes known isomerize dienes under basic conditions 
cycloalkyn know isomer dien under basic condit 

cycloalkynes know isomerize 1,2-dienes under basic conditions 
cycloalkynes know isomer 1,2-dienes under basic condit 

Table 1. Example of sentence pre-processing. 



Chemical compounds are detected within sentences using a combination of 
two classifiers. The first one is a Bayes classifier trained on 3-grams of letters 
whereas the second one uses pattern matching with a small number of manually 
written rules (7 rules) . Each sentence is tokenized in words and each word is clas- 
sified by the two classifiers, precision is prioritized by using the AND combination 
(i.e. a word has to be classified as chemical compound by the two classifiers). 
This hybrid approach (statistical and symbolic) for chemical term recognition 
achieves very good results on a test corpus composed by Organic Chemistry 
articles [12] . 

2.2 Sentence Ranking 

Once sentences are pre-processed, a combination of features (also called metrics) 
is used to assign a score to each sentence. That score reflects how important the 

5 The porter stemmer algorithm [11] is used to normalize words by removing commoner 
morphological and inflexional endings from words. 



(1) 

(2) 

(3) 

(2') 

(3') 



sentences are in relation to the whole document. The main advantage of this 
approach is that zero knowledge is required and that makes the system fully 
adjustable to any language and/or domain. This section formally describe the 
metrics calculated by yachs. 

Authors normally conceive titles as circumscribing the topic of the document. 
Sentences sharing words, containing words related to or similar with the title are 
likely to be relevant. Following this assumption, two metrics computing similarity 
measures between a sentence and the title have been implemented. The first 
measure is the well known cosine angle [TO] between a sentence and the title 
vectorial representations in r. The main weakness of cosine and more generally 
of all similarity measures using words for tokens is that they are relying too much 
on term normalization. Their performance dramatically decrease with wrongly 
or non normalized words. We propose a second similarity measure based on 
the Jaro- Winkler distance [T2] that can bridge morphologically similar words in 
order to smooth normalization and misspelling errors. The original Jaro- Winkler 
measure, denoted Jw, uses the number of matching characters and transpositions 
to compute a similarity score between two terms, giving more favourable ratings 
to terms that match from the beginning (see examples in table [2]) . We have 
extended this measure to calculate the similarity between a sentence s m and the 
title t (see table [3]): 

Jw e (s x ,t) =|jr'^ max Jw(w t , w x ) (1) 

where S' is the term set of s x in which the words w x that already have maximized 
Jw(wt,w x ) are removed. 



Word 1 


Word 2 


Jw 


nucleophile 


nucleophilic 


0.94515 


nucleophile 


electrophile 


0.47643 


diphenyl 


1 , 1-Diphenylmethanone 


0.35516 


1 , 1-Diphenylmethanone 


nucleophile 


0.11038 



Table 2. Examples of Jaro- Winkler distance (Jw) between words. 



Experiments have shown that sentence position within the document is a 
very important feature [lj. Indeed, the information is not homogeneously spread 
across the document but scattered tidily by the author respecting universally 
accepted writing rules. Document beginnings and endings usually contain sen- 
tences that are highly relevant because their original goals are to present and sum 
up the topic. Sentence position is therefore used as metric, denoted P (equation 
[2j, by computing a smoothing parabola depending on the number of sentence 
m in the document. 



Title Generation of Cycloalkyncs by Hydro-Iodonio- Elimination of Vinyl Iodonium Salts 

Sentence Cycloalkylidcnccarbene can provide a ring-expanded cycloalkyne via 1,2-rearrangement. 

Tpreproc. general cycloalkynes hydro-lodonio-elimination vinyl iodonium salt 

Spreproc. cycloalkylidenecarbene provid ring expand cycloalkyne via rearrang 

cosine (no co-occurrencies) 

Jw e 0.43348 

Table 3. Example of similarity measures between the title and a sentence 
(Tpreproc. and Spreproc. are the pre-processed title and the pre-processed sen- 
tence). 



it m is even 

(a-i)-f ■ ( 2 ) 

- — -m — — Otherwise 

2 

We have implemented four other metrics relying on numerical treatments, 



they are computed on the matrix M (previously introduced in section 2.1 1. The 
first one is the sum of word frequencies, denoted F (equation |3| , that uses the 
frequencies of words in sentences. Sentences that are containing important words 
are considered as relevant. 

n 

Fx = ^ 0-x,y (3) 

The second metric, denoted C (equation [4|, relies on the number of chemical 
compounds detected in the sentence giving a penalty to sentences that do not 
contain any chemical compounds. 

„ 1 if x contains at least one chemical compound , , 

c x = { (4) 
I Otherwise 

The third metric, denoted / (equation |5| , represents the interaction relation- 
ship between sentences. The underlying idea is that sentences containing words 
that are used in other sentences are statistically more representative for the 
document [T3]. 

n m 

y=l z=l 
a x , y #0 

The last metric, denoted H (equation [6]), is the sum of the Hamming distances 
computed on the sentence pair words The idea is to give more weight to 
pairs of words that appears independently in sentences. Synonyms and topic- 
related words generally are, according to the Hamming distance, high weighted. 
In order to compute this metric, a second matrix denoted Mh is constructed 
from M. Mh is a n x n triangular matrix constructed from word co-occurrences 



between sentence pairs: 



i—l...n; j — l...n 

if Qx,i 7^ ®x,j 

Otherwise 
x ~ ^ ^ I Otherwise 




Sentences are scored by using a equiprobable linear combinatiorj^] of the nor- 
malized metrics (i.e. ranged in [0, 1]) described above. A ranked sentences list is 
produced by the system allowing to construct the extract by arranging the high 
scored sentences until the desired size is reached. 



3 Experimental Settings 

Considerable interest has been expressed and effort expended in attempting to 
evaluate automatically the quality of the summaries. There exists two different 
types of evaluation: extrinsic and intrinsic |15j . Extrinsic evaluations measure the 
quality of a summary based on how it affects certain tasks. In intrinsic evalua- 
tions, summary's quality is evaluated by an analysis of its content. Most existing 
automated evaluation methods work by comparing the produced summaries to 
one or more reference summaries (ideally, produced by humans). In order to 
evaluate our system, we have collected a testing set from http://pubs.acs.org. 
The testing set is composed by 100 pairs of articles/abstracts coming from dif- 
ferent journals (Organic Letters, Accounts of Chemical Research and Journal 
of Organic Chemistry) of different years (respectively 2000-2002, 2005-2007 and 
2007-2008), different authors and topics. Each document has been cleaned up 
manually from the PDF (or HTML) version (figures, bibliographic references, 
special characters, etc. have been removed). By ways of comparison the corpus 
used in the Document Understanding Conference (DUC^2005 competition was 
also composed of 100 sets. Table [4] shows some statistics about the testing set. 



3.1 Performance Measures 

To evaluate the quality of our generated summaries, we choose to use the ROUGE0 
[16] evaluation toolkit, that has been found to be highly correlated with hu- 
man judgments [T7]- Rouge-n is a n-gram recall measure calculated between a 

6 Other combinations might be considered, but a large training corpus is required to 
tuned the parameters. 

7 Document Understanding Conferences are competitions on text summarization con- 
ducted since 2000 by the National Institute of Standards and Technology (NIST), 
http://www-nlpir. nist.gov 

8 ROUGE is available at http://haydn.isi.edu/ROUGE/. 



Journal 


Year 


Number Sentences Words 


Organic Letters 


2000-2008 


63 


5.313 


104.588 


Accounts of Chemical Research 


2005-2006 


10 


979 


18.337 


The Journal of Organic Chemistry 


2007-2008 


27 


2.631 


66.242 


Total 




100 


8.923 


189.167 



Table 4. Testing corpus description. 



candidate summary and one or more reference summaries. In our experiments 
ROUGE-1, ROUGE-2 and ROUGE-SU4 will be computed. Each generated extract 
will be evaluated by comparison with the author's abstract. The size of the pro- 
duced extracts is set at 5% of the original document (in sentence number) with 
a minimum of three sentences. 



4 Results 



The first experiment is focused on the study of metrics. Figure [T] shows the 
ROUGE results of each metric alone and their combination. As we can see from 
these results, the combination, denoted by ah, always outperforms the best met- 
ric alone. The most discriminant metrics are the similarity measures with the 
title (Jw e and cosine) and the interaction relationship between sentences (/). 
The title similarity measures allow to focus the summary on the document main 
topic, delineated by the author. The similarity measure Jw e that we propose 
is globally the most discriminant metric, its ability to bridge morphologically 
similar words is well adapted for Organic Chemistry documents. The interaction 
metric uses the networks built by words within the document to compute a rele- 
vance score, sentences that are constructed with terms appearing in many other 
sentences are selected. These sentences are judged as being the most represen- 
tative to the document because they are containing most of the information. 
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Fig. 1. ROUGE- 1, Rouge-2 and ROUGE-SU4 recall scores for each metric inde- 
pendently and for their combination (denoted ah). 



A second evaluation compares yachs to a generic statistical summarizer and 
a baseline on the corpus of manually segmented documents (see Figure [2]) . We 
use the Cortex summarizer [14 which is based on the same approach that YACHS, 
namely a combination of relevance metrics, but without the chemical compounds 
detection process and the powerful Jw e metric. The baseline is generated by 
arranging n sentences selected randomly from the document, n being 5% of 
the document sentence number with a minimum of three sentences. In order to 
smooth the baseline results, the average of 100 baseline evaluations is used in 
our experiments. YACHS achieves the best results among the ROUGE evaluations. 
It confirms that the specialized pre-processing and sentence scoring are well 
adapted to process domain specialized (Organic Chemistry) documents. 
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Fig. 2. Rouge- 1, ROUGE-2 and ROUGE-SU4 recall scores of YACHS, Cortex and 
the random baseline. 



The last evaluation models a real world summarization task: a plain text is 
given as input (without manual sentence segmentation) , each summarizer has to 
produce an extract of size equals to 5% of the original document (in sentence 
number). We compare YACHS to six extractive summarizers and one baseline, 
results are shown in figure [3j YACHS, Cortex and the baseline use the same au- 
tomatic sentence segmentation process which consists in a standard sentence 
boundaries detection system enriched with lists of abbreviations. The other sys- 
tems using their own sentence splitters. The baseline is generated by arranging 
n sentences selected randomly from the document, n being 5% of the docu- 
ment sentence number with a minimum of three sentences. Again, the average 
of 100 baseline evaluations is used in our experiments. MEAD^] is a centroid 
based summarizer [18] that extract sentences according to three features: sen- 
tence centrality within the cluster, sentence position within the document and 
weighted similarity with the title. Open Text Summarizer (OTS) [TH] is an Open 
Source project that, similarly to MEAD, use statistical word-frequency meth- 
ods to score sentences that are beforehand parsed. It also incorporates an En- 



9 Available at http://www.sumrnarization.com/rnead/ 



glish language lexicon with synonyms and cue terms. Pertinence Summarizei[ 10 
performs linguistic processing of a document to generates an extract, the sen- 
tence scoring method considering general and specialized (Chemistry) linguistic 
markers. Besides, two frequency-based summarizers are evaluated: CopernicJ^ 
summarizer and the AutoSummarize feature of Microsoft Word. Exact details 
of their algorithms are alas not documented. 
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Fig. 3. Comparison of the Rouge-1, Rouge-2 and ROUGE-SU4 recall scores for 
the seven summarizers and the random baseline. 



YACHS and Cortex clearly stand out from the crowd. A significant score 
margin separates these two systems with the others confirming that these sta- 
tistical techniques work well for Organic Chemistry documents. YACHS achieves 
the best results among all summarizers proving that specialized pre-processing 
and adapted sentence scoring are features allowing to generate better specialized 
extracts. 



5 Conclusion 

In this paper we have described an efficient approach for automatically gen- 
erating extracts from documents in Organic Chemistry. Through experiments 
performed on a corpus composed of scientific articles, we have showed that our 
approach (implemented in the yach^] system) achieves promising results. This 
work represent a good starting point but do show a critical point: a lot of in- 
formation is lost during document pre-processing. Indeed, pictures, tables or 
captions, that are removed during PDF (or HTML) to text conversion, are con- 
taining salient information that can be used to enhance extracts. Among the 
others, there are several points that would be worthy of further investigation: 

10 Available at http://www.pertinence.net/ps/ 

11 Available at http://www.copernic.com/en/products/summarizer/index.html 

12 An demonstration version of YACHS is available at http://daniel.iut.univ- 
metz.fr/yachs 



— Use multi-media information (i.e. pictures, texts, tables, etc.) to generate 
extracts. 

— Fuse text summarization and Question Answering (QA) to model real-world 
complex QA, in which a question cannot be answered by simply stating a 
name, date, quantity, etc. 
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